April 10, 2025
\[ \newcommand\hbb{{\hat{\boldsymbol \beta}}} \newcommand\bb{{\boldsymbol \beta}} \newcommand\expn{{\frac{1}{N} \sum \limits_{i = 1}^N}} \newcommand\sumk{\sum \limits_{k = 1}^K} \newcommand\argminb{\underset{\bb}{\text{argmin }}} \newcommand\argmaxb{\underset{\bb}{\text{argmax }}} \newcommand\gtheta{\mathbf g(\boldsymbol \theta)} \newcommand\htheta{\mathbf H(\boldsymbol \theta)} \]
Goal: Come up with a strategy to learn \(P(\mathbf x)\) given a large set of inputs - \(\mathbf X\)
Success:
Density estimation: Given a proposed data point, \(\mathbf x_i\), what is the probability with which we could expect to see that data point? Don’t generate data points that have low probability of occurrence!
Sampling: How can we generate novel data from the model distribution? We should be able to sample from the distribution!
Representation: Can we learn meaningful feature representations from \(\mathbf x\)? Do we have the ability to exaggerate certain features?
All methods we’ll talk about can be sampled!
VAEs approach this problem assuming the following form for an input, \(\mathbf x_i\) (of arbitrary form):
\[ P(\mathbf x_i | \mathbf z_i) \sim \mathcal N_P(f(\mathbf z_i), \boldsymbol \Sigma_{x|z}) \]
\[ P(\mathbf z_i) \sim \mathcal N_K(\mathbf 0 , \mathcal I_K) \]
\[ Q(\mathbf z_i | \mathbf x_i) \sim \mathcal N_{K}(g(\mathbf x_i), \boldsymbol \Sigma_{z | x}) \]
The goal is to represent the complex input in a low-dimensional latent space that is easy to sample from
Train an encoder model to map \(\mathbf x_i\) to \(\mathbf z_i\)
Train a decoder model to map \(\mathbf z_i\) to \(\hat{\mathbf x_i}\)
Do this in such a way that \(Q(\mathbf z_i | \mathbf x_i)\) is easy to work with
And such that \(\hat{\mathbf x_i} \approx \mathbf x_i\) for all inputs in the training data
A nonlinear, generative generalization of the basic PCA model!
In theory, a well tuned latent space can be used to generate any image that is encompassed by the training set!
Sample a new value from the prior - \(P(\mathbf z_i)\)
Use the decoder to translate the latent value to an image!
Works way better than deterministic autoencoders for generating new images that were not seen in the training data
VAEs are conceptually pretty simple
The difficult part is training!
Assume that \(f(\mathbf z_i , \phi)\) is an arbitrarily complex function of many parameters (think NNs) that maps the latent variable to the input space.
We would like to find \(\phi\) that maximizes the log-likelihood we would see our input data:
\[ \hat{\boldsymbol \phi} = \underset{\phi}{\text{argmax }} \expn \log P(\mathbf x_i | \boldsymbol \phi) \]
Since the latent variable are learned, we want to marginalize them out of our likelihood!
Under our VAE assumption:
\[ \log P(\mathbf x_i | \boldsymbol \phi) = \log \int \mathcal N_P(\mathbf x_i | f(\mathbf z_i , \boldsymbol \phi) , \boldsymbol \Sigma_{x|z}) \mathcal N_K(\mathbf z_i | \mathbf 0 , \mathcal I_K) d \mathbf z_i \]
This isn’t tractable
I’m reviewing this pretty detailed approach because we’ll see it again for diffusion models
Using Bayes’ rule and some algebra, we know that:
\[ P(\mathbf x_i | \boldsymbol \phi) = \frac{P(\mathbf x_i | \mathbf z_i , \boldsymbol \phi)P(\mathbf z_i)}{P(\mathbf z_i | \mathbf x_i , \boldsymbol \phi)} \]
If we knew the conditional posterior for the latent variable given the input data, we could solve this directly!
Since the mapping to \(\mathbf z_i\) from \(\mathbf x_i\) should be suitably nonlinear, we won’t get to know this distribution!
Instead, approximate \(P(\mathbf z_i | \mathbf x_i , \boldsymbol \phi)\) with another distribution of the same dimensionality (usually multivariate normal with diagonal covariance):
\[ P(\mathbf z_i | \mathbf x_i , \boldsymbol \phi) \approx Q(\mathbf z_i | \mathbf x_i , \boldsymbol \theta) \]
Using some clever algebra and properties of expectations, we can show that our optimand (a new word I’m coining) is:
\[ E_Q[\log P(\mathbf x_i | \mathbf z_i , \boldsymbol \Phi)] - D_{KL}(Q(\mathbf z_i | \mathbf x_i) || P(\mathbf z_i)) + D_{KL}(Q(\mathbf z_i | \mathbf x_i) || P(\mathbf z_i | \mathbf x_i)) \]
The first term is the expected negative log likelihood of the input given the distribution of the approximate conditional posterior on the latent variable
The second term is the KL divergence between the approximation and the prior
The third term is the KL divergence between the approximation and the true conditional posterior
\[ E_Q[\log P(\mathbf x_i | \mathbf z_i , \boldsymbol \phi)] - D_{KL}(Q(\mathbf z_i | \mathbf x_i) || P(\mathbf z_i)) + D_{KL}(Q(\mathbf z_i | \mathbf x_i) || P(\mathbf z_i | \mathbf x_i)) \]
In a few words, what is the KL divergence between two distribution?
Which terms above can we learn and which ones can we not learn?
Do we know anything about the value of the unknown(s)?
The evidence lower bound:
\[ E_Q[\log P(\mathbf x_i | \mathbf z_i , \boldsymbol \phi)] - D_{KL}(Q(\mathbf z_i | \mathbf x_i) || P(\mathbf z_i)) \]
can be maximized and used to train VAEs!
In the context of image generation:
The first term is the reconstruction error - how far is our approximate image created using the latent space to the true image?
The second term is a prior penalty - regularize the problem and try to keep the latent space as close as possible to the prior!
Likelihood/prior tradeoff to promote good fit to training data vs. keeping the latent distribution tractable!
What can we do with VAEs?
Let’s look at a more realistic image example.
Decent reconstruction, blurry samples!
This makes a lot of sense, though.
The conditional posterior:
\[ P(\mathbf x | \mathbf z) \propto -(\mathbf x - f(\mathbf z))^T \boldsymbol \Sigma^{-1} (\mathbf x - f(\mathbf z)) \]
which is just a scaled squared error
VAEs are learning probability weighted combinations of all of the images in the training set
It makes sense that it would learn some sort of blurry average!
There’s a pretty clever “fix” for this phenomenon that can help with the unblurring of recoveries called the \(\beta\)-VAE
Goal: Maximize the ELBO
\[ E_Q[\log P(\mathbf x_i | \mathbf z_i , \boldsymbol \phi)] - D_{KL}(Q(\mathbf z_i | \mathbf x_i) || P(\mathbf z_i)) \]
The first term is reconstruction error
The second term controls overresponse to the training instances
Goal: Maximize the ELBO
\[ E_Q[\log P(\mathbf x_i | \mathbf z_i , \boldsymbol \phi)] - \beta D_{KL}(Q(\mathbf z_i | \mathbf x_i) || P(\mathbf z_i)) \]
The second term is trying to encourage the approximate conditional distribution to be close to the prior
Multiply it by a constant, \(\beta > 0\)
If \(\beta = 0\), minimize reconstruction error directly (deterministic autoencoder)
If \(0 < \beta < 1\), let the training data speak a little more loudly than normal
If \(\beta = 1\), VAE
If \(\beta > 1\), create more prior pull
VAEs can also be used to generate new images
Process for unconditional generation:
Take a draw from \(\mathbf z^* = P(\mathbf z)\)
Pass \(\mathbf z^*\) through the decoder to get a new image
VAEs have somewhat fallen out of style for image generation - more GPUs means more complicated models can be used
Still a great intro to generative models for images!
A solid statistical model that makes a lot on sense in the context of PCA
Solid statistical theory
Easy to learn \(P(\mathbf x)\), sample, and generate
Easy to edit
A hallmark of VAEs is a rich latent representation of the data
In theory, \(\mathbf z\) contains most of the information about the images
At their core, these two images are the same
Let’s suppose that each of these images can be meaningfully represented by latent vectors \(\in \mathbb R^K\)
\[ \mathbf z \text{ ; } \mathbf z' \]
A proposal:
\[ \mathbf z' - \mathbf z = \mathbf q \]
where \(\mathbf q\) corresponds to a latent representation of festive garbage bag-ness
A general thought:
Is the latent space set up in such a way that we can add and subtract latent values to get differences between two different images?
A generic strategy with attribute labelled instances to add or subtract a feature:
Train a VAE on the entire dataset
Encode all images in \(\mathbf Z\)
Find the average latent value for images with the attribute and images without the attribute
Subtract!
What’s left over should sorta correspond to the latent vector for the feature of interest!
Given the attribute vector, we can add it to any image without the attribute and (hopefully) edit the image to have the feature!
Encode an image without the attribute, \(\mathbf z_i\)
Add the attribute, \(\mathbf z_i' = \mathbf z_i + \mathbf q\)
Decode the latent vector to an output image.
Let’s look at an example that works.
Most progress on these tasks have leveraged more advanced versions of VAEs:
Hierarchical VAEs
Vector Quantized VAEs (make the latent space discrete rather than continuous, sort of a combination of K-Means + VAE)
DALL-E 2 used VQ-VAEs as a first step for multimodal (read text + image) image generation
Sometimes, we might want to generate images that have certain features
Instead of putting a hat on an existing image, generate a new image with a hat!
How can we introduce this info into an autoencoder.?
Assume we have a training set of images with a coded collection of attributes, \(\mathbf a\).
After the encoder network, we get an unconditional latent representation of the input image
\[ \mathbf x \rightarrow \mathbf z \in \mathbb R^K \]
Easy trick: concatenate the attribute vector in the latent space!
\[ \mathbf x \rightarrow [\mathbf z , \mathbf a] \rightarrow \hat{\mathbf x} \]
The encoder learns how to put \(\mathbf x\) into an unconditional latent space
Then, the decoder learns how to take a latent code with conditioning instructions and produce the upstream images
Image with a Hat to \(\mathbf z\)
Ensure that \(\mathbf z + \mathbf a\) goes back to a hat!
Generation: Sample from \(\mathbf z\) and add the required attributes to the decoder
Generate a new image!
So far, we’ve talked about two types of generative models.
Autoregressive Models
\[ P(\mathbf x) = \prod \limits_{t = 1}^T P(x_t | x_1,x_2,...,x_{t-1}) \]
Advantages:
Directly compute and maximize \(P(\mathbf x)\)
Generates high quality images due to pixel by pixel generation strategy
Disadvantages:
Very slow to train
Very slow to generate high res images
No explicit latent code
Variational Autoencoders
\[ P(\mathbf x) \ge E_{Q}[\log P(\mathbf x | \mathbf z)] - D_{KL}(Q(\mathbf z | \mathbf x) || P(\mathbf z)) \]
\[ P(\mathbf x) = \int P(\mathbf x | \mathbf z) P(\mathbf z) dz \]
Advantages:
Fast image generation
Very rich latent codes
Disadvantages:
Maximizing on a lower bound, not necessarily close to the truth
Generated images often blurry due to averaging behavior
Another approach to this problem is called the generative adversarial network (GAN)
The distinguishing feature of the GAN compared to the other models is that it will function by giving up on explicitly modeling \(P(\mathbf x)\)
However, we’ll still be able to draw high quality samples from \(P(\mathbf x)\) given a set of input data!
The premise:
Introduce a latent variable \(\mathbf z\) with a simple prior, \(P(\mathbf z)\).
Then, sample \(\mathbf z \sim P(\mathbf z)\) and pass to a generator network
\[ \hat{\mathbf x} = g(\mathbf z) \]
where \(g()\) is a sufficiently nonlinear function.
Training:
Given \(\mathbf X\), find \(g()\) and \(\mathbf Z\) that minimizes the reconstruction error
\[ \expn \|\mathbf x_i - \hat{\mathbf x}_i\|^2_2 \]
We can do this using PCA or deterministic autoencoders
Instead, we need to map inputs to distributions in the latent space and the recovered input space.
For VAEs, we did this by making an assumption about the likelihood of observing the input data:
\[ P(\mathbf x | \mathbf z) = \mathcal N_P(\mathbf x | g(\mathbf z), \boldsymbol \Sigma) \]
This assumption can be kinda restrictive
Leads to blurring
Forces strong smoothness that can reduce the crispness of generated images
Instead, do this in a distribution free way
\[ P(\mathbf x | \mathbf z) = ?? \]
This rules out the marginalization strategy:
\[ P(\mathbf x) = \int P(\mathbf x | \mathbf z)P(\mathbf z) d\mathbf z \]
So, how does this work?
Let \(P(\mathbf x)\) be the true, unseen distribution over the input data and let \(\mathbf Q(\mathbf x)\) be an approximate mapping of \(\mathbf z\) to the input space of the following form:
\[ \mathbf z' \sim P(\mathbf z) \]
\[ \mathbf x = g(\mathbf z') \]
\[ Q(\mathbf x) = \frac{\partial}{\partial x_1}...\frac{\partial}{\partial x_P} \int P(\mathbf z) dz \]
This looks really complicated, but the \(Q(\mathbf x)\) formula is just the expression for transforming a random variable (e.g. \(\mathbf x \to g(\mathbf x)\) )
Thus, our goal is to find some \(g()\) that maps the prior to the input space!
Approaching this in standard ways is largely impossible:
We don’t know what \(P(\mathbf x)\) is
All we get is a set of draws from \(P(\mathbf x)\)
We hope that \(Q(\mathbf x)\) is close to \(P(\mathbf x)\) and can substitute one from the other.
The trick: Let \(P(\mathbf x)\) and \(Q(\mathbf x)\) be different distributions with the same support.
We say that \(P(\mathbf x) = Q(\mathbf x)\) if the two yield the same values for all input values of \(\mathbf x\) in the shared domain.
Or:
\[ \frac{P(\mathbf x)}{Q(\mathbf x)} = 1 \text{ } \forall \text{ } \mathbf x \in \mathbf X \]
Density ratios can be uncovered for arbitrary \(\mathbf x\) across two distribution using a clever trick.
Suppose we have \(N\) samples from \(P(\mathbf x)\) and \(N\) samples from \(Q(\mathbf x)\)
Associate each draw with a binary value that tells us which distribution the sample came from:
\(y_i = 1\) if the draw was taken from \(P(\mathbf x)\) (real data)
\(y_i = 0\) if the draw was taken from \(Q(\mathbf x)\) (fake data)
For arbitrary \(\mathbf x\), we can rewrite the density ratio as:
\[ \frac{P(\mathbf x)}{Q(\mathbf x)} = \frac{p(\mathbf x | y = 1)}{p(\mathbf x | y = 0)} \]
Now, using Bayes’ rule we can rewrite the above conditionals as:
\[ \frac{p(\mathbf x | y = 1)}{p(\mathbf x | y = 0)} = \frac{p(y = 1 | \mathbf x)p(\mathbf x)}{p(y = 1)}\frac{p(y = 0)}{p(y = 0 | \mathbf x)p(\mathbf x)} \]
Cancelling and rearranging:
\[ \frac{P(\mathbf x)}{Q(\mathbf x)} = \frac{p(y = 1 | \mathbf x)}{p(y = 0 | \mathbf x)}\frac{p(y = 0)}{p(y = 1)} \]
Assume that we we will always have an equal number of \(y = 0\) and \(y = 1\):
\[ \frac{P(\mathbf x)}{Q(\mathbf x)} = \frac{p(y = 1 | \mathbf x)}{p(y = 0 | \mathbf x)} \]
Finally, since this is a two class problem by construction:
\[ \frac{P(\mathbf x)}{Q(\mathbf x)} = \frac{p(y = 1 | \mathbf x)}{1 - p(y = 1 | \mathbf x)} \]
\[ \frac{P(\mathbf x)}{Q(\mathbf x)} = \frac{p(y = 1 | \mathbf x)}{1 - p(y = 1 | \mathbf x)} \]
Given samples from \(P(\mathbf x)\) and \(Q(\mathbf x)\) of arbitrary density with the same support, we can train a classifier on the samples and compute the density ratio based on that classifier
No assumption needed
Cost: We can never get \(P(\mathbf x)\) directly
Density ratios (or differences between distributions) are equivalent to classification tasks!
Consider the generic binary classification task under Bernoulli likelihood:
\[ \boldsymbol \Phi = \underset{\boldsymbol \phi}{\text{argmax }} E[y \log D_{\phi}(\mathbf x) + (1 - y) \log(1 - D_{\phi}(\mathbf x))] \]
where \(D_{\phi}(\mathbf x)\) is the probability that \(\mathbf x\) belongs to class 1.
We can define the maximal likelihood as:
\[ V(P,Q) = \underset{\phi}{\text{max }} E[y \log D_{\phi}(\mathbf x) + (1 - y) \log(1 - D_{\phi}(\mathbf x))] \]
This makes sense - we’re only able to say with some level of confidence that we’re making the correct prediction if there is some difference between the feature vectors between the classes
It goes even deeper than that, though!
\[ V(P,Q) = E[y \log D_{\Phi}(\mathbf x) + (1 - y) \log(1 - D_{\Phi}(\mathbf x))] \]
We assume that all \(\mathbf x | y = 1 \sim P(\mathbf x)\) and \(\mathbf x | y = 0 \sim Q(\mathbf x)\)
Since expectations are linear:
\[ V(P,Q) = \underset{\phi}{\text{max }} \frac{1}{2} E_{P(x)}[\log D_{\phi}(\mathbf x)] + \frac{1}{2} E_{Q(x)}[ \log(1 - D_{\phi}(\mathbf x))] \]
0 is Q and 1 is P
Multiply by one-half because we can
Given \(P(\mathbf x)\) and \(Q(\mathbf x)\), it turns out that we actually know what \(D_{\Phi}(\mathbf x)\) will be!
\[ \text{max } \frac{1}{2} E_{P(x)}[\log D_{\phi}(\mathbf x)] + \frac{1}{2} E_{Q(x)}[ \log(1 - D_{\phi}(\mathbf x))] = \]
\[ \text{max } \frac{1}{2} \int_x P(x)[\log D_{\phi}(\mathbf x)] + Q(x)[ \log(1 - D_{\phi}(\mathbf x))] dx \]
Because, this is an integral, it suffices to find the maximum for each \(\mathbf x\)
Assuming we know and can evaluate \(P(\mathbf x)\) and \(Q(\mathbf x)\), we just need to find \(D_{\phi}(\mathbf x)\) that maximizes:
\[ P(x)[\log D_{\phi}(\mathbf x)] + Q(x)[ \log(1 - D_{\phi}(\mathbf x))] \]
Omitting the simple calculus (taking the derivative w.r.t. \(D(\mathbf x)\) ), we find that the optimal critic is:
\[ D_{\Phi}(\mathbf x) = \frac{P(\mathbf x)}{P(\mathbf x) + Q(\mathbf x)} \]
A neat result: if we knew \(P(\mathbf x)\) (distribution where y = 1) and \(Q(\mathbf x)\) (distribution where y = 0) for a classifier under cross entropy loss, then we wouldn’t need to optimize! This is the optimal classifier that maximizes the log-likelihood!
This is how LDA/QDA/Naive Bayes work.
Plugging this into our equation and add/subtracting \(\log 2\) (because reasons):
\[ V(P,Q) = \frac{1}{2} E_{P(x)}\left[\log \frac{P(\mathbf x)}{\frac{1}{2}(Q(\mathbf x) + P(\mathbf x))}\right] + \frac{1}{2} E_{Q(x)}\left[ \log\frac{Q(\mathbf x)}{\frac{1}{2}(Q(\mathbf x) + P(\mathbf x))}\right] \]
\[ - \log 2 \]
These two expectations are special - they are KL Divergences:
\[ D_{KL}(P || Q) = \int P(\mathbf x) \log \frac{P(\mathbf x)}{Q(\mathbf x)} dx = E_{P(x)} \left[ \log \frac{P(\mathbf x)}{Q(\mathbf x)} \right] \]
\[ V(P,Q) = \frac{1}{2} D_{KL}\left(P(\mathbf x) || \frac{1}{2}(Q(\mathbf x) + P(\mathbf x)) \right) + \]
\[ \frac{1}{2} D_{KL}\left(Q(\mathbf x) || \frac{1}{2}(Q(\mathbf x) + P(\mathbf x)) \right) - \log 2 \]
We’re finding the distance of our two distribution w.r.t a common middle point
This is a special divergence called the Jensen-Shannon divergence between \(P(\mathbf x)\) and \(Q(\mathbf x)\)
Like KL, it tells us how far two distributions are from one another
However, it is symmetric
Sort of unimportant here, but cool to know!
The loss for the optimal classifier (e.g. minimal loss) for any \(\mathbf x\) and \(y\) under cross entropy loss (e.g. finding the parameters that maximize the likelihood) is equivalent to the Jensen-Shannon divergence between the two data generating distributions!
Recall that the goal of a GAN is to find some \(Q(\mathbf x)\) that generates samples that are statistically indistinguishable from the ground truth sample generated by \(P(\mathbf x)\)
We can’t alter the ground truth
We can alter the proposals!
Let \(\boldsymbol \Theta\) be a set of values that parameterize the proposal distribution, \(Q(\mathbf x | \boldsymbol \Theta)\).
With \(N\) samples from \(P(\mathbf x)\) and \(N\) samples from \(Q(\mathbf x | \boldsymbol \Theta)\), we can estimate the JSD by maximizing the log-likelihood of a classifier trained to discriminate between \(P(\mathbf x)\) (the real instances) and \(Q(\mathbf x)\) (the fake instances)
Therefore, it is optimal in the generative sense to find \(\boldsymbol \Theta\) that minimizes the Jensen-Shannon Divergence!
This is a bit confusing, so let’s look at some examples.
This is example assumes simple unidimensional nice probability distributions
Let’s treat this like a decoder from a VAE.
Let \(\mathbf \pi(\mathbf z)\) be an easy to work with prior over a latent space in \(K\) dimensions
Then \(z' \sim \pi(\mathbf z)\)
\(\hat{x} = g(\mathbf z')\)
If \(g()\) is a function learned via a neural network, then we’ve created an arbitrarily complex mapping of simple \(\mathbf z\) to a complex approximate distribution!
This is referred to as a generator network
The GAN objective:
Let \(Q(z) = \mathcal N_K(z | \mathbf 0 , \mathcal I_K)\) and \(g(\mathbf z)\) be a function learned via a neural network with parameters \(\boldsymbol \theta\)
Let \(\boldsymbol \phi\) be values that parameterize a discriminator network that seeks to find sufficiently flexible MLE classifiers to discriminate between the real and fake data.
Train a generator network to find \(\boldsymbol \theta\) that minimizes the JSD between the generator and the ground truth while simultaneously training a discriminator network to find \(\boldsymbol \phi\) that maximizes the log-likelihood of the discriminator model.
\[ \underset{\theta}{\text{min }} \underset{\phi}{\text{max }} \frac{1}{2} E_{P(\mathbf x)}[\log D_{\phi} (\mathbf x)] + E_{Q(\mathbf z)}[\log(1 - D_{\phi}(g_{\theta}(\mathbf z))] \]
This is the classic f-GAN
GANs approach the problem of generation as a minimax game
Train a model to produce passable fakes
Train another model to discriminate as well as possible
Update the generator given the performance of the discriminator and vice versa!
It’s pretty broad because it makes basically no assumptions about the structure of the true and approximate distributions!
There isn’t a loss for a GAN, per se
Instead, we have our discriminator performance (cross-entropy) and our generator performance (JSD)
Train until both values converge
In theory, will converge since it results in a zero-sum game (Nash equilibrium)
QTM 315 coming in handy in ML2!
The lack of assumptions, though, comes at some costs:
We can’t ever really know \(P(\mathbf x)\)
We aren’t ever really going to learn \(P(\mathbf z | \mathbf x)\). There is no real way to encode an image and use this representation to operate on the image
There are some clever ways around this, though!